Skip to content

feat(engine): browserGpuMode "auto" — probe-once WebGL detection with software fallback#642

Merged
jrusso1020 merged 4 commits intomainfrom
feat/auto-detect-browser-gpu
May 6, 2026
Merged

feat(engine): browserGpuMode "auto" — probe-once WebGL detection with software fallback#642
jrusso1020 merged 4 commits intomainfrom
feat/auto-detect-browser-gpu

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

What

Adds browserGpuMode: "auto" to the engine config and makes it the new CLI default for local renders.

When the host doesn't have a usable GPU (CI containers, eval rigs without GPU passthrough, dev VMs), Chrome's hardware-mode WebGL flags (--use-gl=egl/metal/d3d11) silently leave WebGL unavailable — getContext("webgl") returns null, three.js' WebGLRenderer dies, the canvas stays black. Surfaced today by Abhay's c2v-eval failing on a docker render of a hyperframes bundle that uses three.js + a custom fragment shader.

In auto mode the engine launches a tiny Chrome on first render with hardware GPU args, checks canvas.getContext("webgl") !== null, and caches the result for the lifetime of the process. Hardware GPUs keep their fast path (~0 added overhead); no-GPU hosts get SwiftShader without the user needing to know to pass --no-browser-gpu.

Also adds --enable-unsafe-swiftshader to the software flag set — Chrome 120+ deprecated implicit SwiftShader fallback and prints a deprecation warning unless the flag is set explicitly. Despite the name, this is exactly the pre-120 behaviour; the rename is about Chrome's threat model on the open web, not about the rendering itself.

Why

The render path for SwiftShader has existed for a while in the engine — it's already what the CLI's --no-browser-gpu flag flips to, and what Docker mode forces unconditionally. The gap was discovery: users had to know that on a no-GPU host they need to pass that flag, otherwise WebGL-backed compositions silently render black frames.

Concretely, this came up when Abhay's evaluation harness — which renders user-submitted hyperframes code in a CPU-only docker — couldn't render scenes that used three.js or custom shaders, even though the engine could fundamentally do it. He had two unattractive options: provision a GPU node (expensive), or have users pre-render their videos (gameable). With auto the eval container Just Works, and so does anyone else running on a no-GPU host.

How

Engine (@hyperframes/engine):

  • EngineConfig.browserGpuMode extended to "software" | "hardware" | "auto"
  • New resolveBrowserGpuMode(mode, options) in services/browserManager.ts: pure pass-through for "software" / "hardware"; for "auto" it launches a tiny Chrome with the platform's hardware GPU args, checks WebGL availability, caches the result for the process. Probe failures (Chrome launch error, missing canvas API, etc.) fall back to "software" — the always-safe answer.
  • services/frameCapture.ts calls resolveBrowserGpuMode before buildChromeArgs so the resolved mode flows through.
  • Default engine config stays "software" (conservative for non-CLI embedders); the "auto" default lives in the CLI to avoid surprising producer/parity-harness embedders with a probe-on-launch.
  • getBrowserGpuArgs("software") now also emits --enable-unsafe-swiftshader to silence Chrome 120+'s deprecation warning.

CLI (@hyperframes/cli):

  • resolveBrowserGpuForCli now returns the tri-state "auto" | "hardware" | "software". Priority unchanged: docker forces software → explicit flag → env var → default (now "auto" instead of true).
  • RenderOptions gains browserGpuMode? (tri-state for local) alongside the existing browserGpu? (boolean for docker run-args). Backwards-compat shim: callers that still pass browserGpu: boolean get the old behaviour.
  • --browser-gpu description updated to reflect the new default.

Behaviour matrix:

Invocation Mode resolved
hyperframes render (no flags) auto (NEW)
hyperframes render --browser-gpu hardware
hyperframes render --no-browser-gpu software
hyperframes render --docker software (unchanged)
PRODUCER_BROWSER_GPU_MODE=hardware hardware
PRODUCER_BROWSER_GPU_MODE=software software
PRODUCER_BROWSER_GPU_MODE=auto auto

Probe cost: ~1-2 s on first render in a worker, cached at zero cost thereafter. Misclassification toward software is the safe failure mode — SwiftShader always renders correctly, just slower; misclassifying toward hardware would have caused render-time failures.

Test plan

  • bun run --filter @hyperframes/engine test — 535/535 pass (incl. 4 new tests on resolveBrowserGpuMode covering pass-through, probe failure, caching, cache reset)
  • bun run --filter @hyperframes/cli test — 256/256 pass (incl. new test for browserGpuMode: "auto" flowing into producerConfig, plus updated resolveBrowserGpuForCli test for the tri-state)
  • bun run --filter '*' typecheck — clean across all 7 packages
  • bunx oxfmt --check + bunx oxlint clean on all 8 touched files
  • Lefthook pre-commit format/lint/typecheck — green
  • Empirical probe on this no-GPU devbox: auto"software" in 240 ms, cached 0 ms on subsequent calls
  • Manual: ran hyperframes snapshot against Abhay's bundle on the same devbox; renders correctly via SwiftShader (verified earlier in the Slack thread context)

Notes

  • Auto-probe runs in the same process, not a subprocess. If a worker pool spawns multiple workers each launches one probe — acceptable since the probe is cheap and the answer can legitimately differ across workers (different chrome paths / env).
  • The probe uses the platform's hardware ANGLE backend for the probe itself, then getBrowserGpuArgs("hardware") for the actual render args. Same flag set in both — what we measure is what we use.
  • _autoBrowserGpuModeCache and _resetAutoBrowserGpuModeCacheForTests are intentionally exported with _ prefix as test-only escape hatches; they're not part of the public surface.

— Rames Jusso

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: comment — close to mergeable, but a couple of important issues worth addressing first.

The design is right. Resolving auto to a concrete "software" | "hardware" once per process, failing safe to software on any probe error, mirroring the real render args in the probe, and keeping the engine-config default at "software" while flipping CLI to "auto" — all the right calls. Tests cover pass-through, probe failure, caching, and the CLI tri-state. Two real issues, one stale-doc miss, and a few nits.

Important

  • packages/engine/src/services/browserManager.ts, resolveBrowserGpuMode — concurrent-probe race. parallelCoordinator.executeWorkerTask fans out N workers via Promise.all in the same process (see parallelCoordinator.ts:259tasks.map(...) then Promise.all). Each worker calls createCaptureSessionresolveBrowserGpuMode. While _autoBrowserGpuModeCache === undefined, every worker hits the cache !== undefined guard, sees undefined, and launches its own probe Chrome. On a no-GPU host running --workers 4, that's 4 simultaneous Chrome launches fighting for CPU/RAM at startup — exactly the hosts where Chrome launch is slowest and most fragile. The PR body's "~1-2 s on first render" assumes one probe; under contention the slowest of 4 concurrent launches is materially worse. Fix is small: cache the promise, not the result —

    let _autoBrowserGpuModeProbe: Promise<"software" | "hardware"> | undefined;

    First call assigns, subsequent calls await the same promise. The PR's Notes section says "If a worker pool spawns multiple workers each launches one probe — acceptable since the probe is cheap." That reasoning only holds for separate-process worker pools; the engine's parallelCoordinator is in-process, so the assumption doesn't apply here. Worth fixing before this lands.

  • packages/cli/src/docs/rendering.md (lines 23, 29) — docs are now wrong. Line 23: "enabled by default for local renders". Line 29: "Local renders use browser GPU capture automatically". After this PR the local default is auto (probe-then-decide), not "browser GPU on." Both lines need to mention the new behaviour and the auto/hardware/software env value. Stale-by-omission docs are a real footgun on a flag that flips Chrome's rendering backend.

  • packages/engine/src/services/browserManager.ts, resolveBrowserGpuMode — no observability on which path was picked. The PR body itself flags "misclassification toward software is the safe failure mode" — which means production will silently pick software when hardware would have worked, and we'll have no log line to diagnose it. One console.info("[BrowserManager] auto browser GPU resolved to '<mode>' (probed in Xms)") (or gated on config.debug) is enough to answer "is the auto-probe actually firing on $host?" from logs. As-is, the only signal is "renders feel slow." Add this before merge.

Nits

  • packages/engine/src/services/browserManager.ts:339getBrowserGpuArgs parameter type. With "auto" now in EngineConfig["browserGpuMode"], the function silently accepts a value the caller is supposed to have already collapsed. The defensive if (mode === "auto") branch (lines around 433–437 of the new code) papers over a type-system gap. Cleaner: tighten the param to mode: "software" | "hardware" and require callers to pass a resolved mode. The "should not reach here" becomes a compile-time guarantee instead of a runtime fallback.

  • packages/engine/src/services/browserManager.ts_autoBrowserGpuModeCache exported as let. export let _autoBrowserGpuModeCache plus _resetAutoBrowserGpuModeCacheForTests is a fine test-escape-hatch pattern, but the underscore is convention-only — anything in the workspace can import { _autoBrowserGpuModeCache } and read it. Either keep it module-private and only export the reset (tests don't actually need to inspect the cache), or move both behind a __test__ namespace that's lint-flagged for non-test use.

  • packages/engine/src/services/browserManager.ts — probe could be tighter than getContext("webgl") !== null. On some Linux VMs --use-gl=egl returns a context that's actually backed by software (Mesa llvmpipe / SwiftShader-via-ANGLE). The probe says "hardware" but the real render is no faster than software — and we paid a Chrome launch for nothing. A read of gl.getParameter(gl.UNMASKED_RENDERER_WEBGL) (via the WEBGL_debug_renderer_info extension) and rejecting /SwiftShader|llvmpipe|Software/i would tighten this. Edge case, optional.

  • packages/engine/src/services/browserManager.ts--ignore-gpu-blocklist in the probe. The probe sets it; the real render args (via buildChromeArgs) also set it (line 286). Consistent — but worth confirming explicitly that any future change to buildChromeArgs keeps the probe and render args in sync. A single getBaseChromeArgs() helper shared by both call sites would prevent the two from drifting.

Praise

  • The fail-safe-to-software policy on any probe error (Chrome launch, missing canvas API, navigation, etc.) is exactly right. Misclassifying toward software is recoverable (slow render); misclassifying toward hardware is a render-time failure. The PR commits to the correct default direction.
  • Engine-config default stays "software", CLI default flips to "auto". That correctly draws the line between embedders (parity-harness, producer's eval pipeline) who don't want surprise probe-on-launch, and CLI users who want it Just Work. Good separation of concerns.
  • --enable-unsafe-swiftshader rationale in the comment block in getBrowserGpuArgs is the right amount of "future engineer reading this won't have to re-derive why we set a flag named unsafe". Keep that level of comment everywhere.
  • PRODUCER_BROWSER_GPU_MODE=auto is now first-class in both the engine config (config.ts:177) and the CLI resolver, with symmetric handling. Nice.

— Vai

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid feature — auto-detection is the right call for making CPU-only hosts Just Work. The probe → cache → fallback architecture is clean. A few items to address (echoing + extending Vai's review):

1. Concurrent probe race (Important)
_autoBrowserGpuModeCache stores the resolved string, not the Promise. With --workers 4 on a CPU-only host, all 4 workers call resolveBrowserGpuMode("auto") near-simultaneously. All see undefined cache, all launch Chrome probes. Wasteful (4x the ~2s probe cost) and nondeterministic timing.

Fix: cache the Promise<"software" | "hardware"> itself:

let _autoBrowserGpuModePromise: Promise<"software" | "hardware"> | undefined;

export async function resolveBrowserGpuMode(mode, options) {
  if (mode !== "auto") return mode;
  if (!_autoBrowserGpuModePromise) {
    _autoBrowserGpuModePromise = probeWebGLAvailability(options);
  }
  return _autoBrowserGpuModePromise;
}

First caller creates the promise, concurrent callers await the same one. Zero extra launches.

2. Stale docs (Important)
packages/cli/src/docs/rendering.md lines 23 and 29 still say "enabled by default for local renders" — should say "auto-detected" to match the new behavior. Same PR, easy fix.

3. Log the resolved mode (Important)
Add a console.log (or whatever the logger is) when auto-mode resolves. Without it, "always falls back to software even with GPU present" would be invisible in production logs. One line:

console.log(`[engine] auto GPU probe: ${_autoBrowserGpuModeCache}`);

4. getBrowserGpuArgs("auto") defensive fallback (Nit)
The defensive "auto" case in getBrowserGpuArgs is good, but worth a console.warn since reaching it means the resolution step was skipped — a logic bug, not a runtime condition.

Everything else looks correct: fail-safe-to-software on any probe error, correct embedder/CLI split (engine defaults software, CLI defaults auto), --enable-unsafe-swiftshader comment is clear, test coverage maps onto the new behavior.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three important items cleanly addressed in 076b6144:

  1. Concurrent probe race — cache is now Promise<"software" | "hardware">, not the resolved string. The p1 === p2 === p3 reference equality test proves dedup at the API level. Correct.
  2. Stale docs — rendering.md lines 23 + 29 updated to describe the auto/hardware/software trichotomy.
  3. Log line[hyperframes] browserGpuMode auto → <mode> (<reason>) on stderr, once per process. Reason includes "WebGL probe succeeded" / "WebGL unavailable" / "probe failed (...)" / "puppeteer unavailable". Cache hits don't re-log. Good.

The IIFE pattern for the cached promise (_autoBrowserGpuModeCache = (async () => { ... })()) is clean — captures the options from the first caller and deduplicates all concurrent awaits.

536/536 engine + 256/256 CLI. LGTM.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-verdict: Approve.

All 3 important items from my prior review cleanly resolved in 076b6144:

  • #1 Concurrent-probe race — fixed at the root. Cache type is now Promise<"software" | "hardware"> | undefined so the first caller's in-flight Promise is what subsequent concurrent callers await. The new regression test asserts reference equality across simultaneous probe calls (p1 === p2 === p3) — that's the exact contract pinned at the API level, not just "same eventual result". Single Chrome launch on --workers 4 on a no-GPU host now, instead of N.
  • #2 Stale rendering.md — lines 23 + 29 updated to describe the auto / hardware / software trichotomy explicitly. User-visible contract now matches the code.
  • #3 Resolved-mode log line[hyperframes] browserGpuMode auto → <mode> (<reason>) emitted once per process when the probe resolves. Reason field is specific (WebGL probe succeeded / WebGL unavailable / probe failed (...) / puppeteer unavailable), so a regression to silent-software-fallback would be diagnosable from production logs immediately. Cache hits correctly don't re-log.

Verification claims (engine 536/536, CLI 256/256, lint/format/typecheck clean) match the test-file expansion in the diff. Skipping the 4 nits per the team's "cheap-only" preference is fine.

Three rounds, three clean responses. Shipping.

— Vai

jrusso1020 added 4 commits May 6, 2026 17:33
…software

When the host doesn't have a usable GPU (CI containers, eval rigs without
GPU passthrough, dev VMs), Chrome's hardware-mode WebGL flags
(`--use-gl=egl/metal/d3d11`) silently leave WebGL unavailable —
`getContext("webgl")` returns null, three.js' WebGLRenderer dies, the
canvas stays black. Surfaced today by Abhay's c2v-eval failing on a
docker render of an hf bundle that uses three.js + a custom fragment
shader.

The fix that's been there: `--use-gl=angle --use-angle=swiftshader` (CPU
software WebGL, ~5-50× slower but pixel-identical). The engine already
exposed `browserGpuMode: "software"` for this. The gap was discovery —
users had to know to pass `--no-browser-gpu` on no-GPU hosts.

This change adds `browserGpuMode: "auto"` (now the CLI default for local
renders): on first launch in the process, probe Chrome with hardware
args, check `canvas.getContext("webgl") !== null`, cache the result.
~1-2 s on first render, free on every subsequent render in the same
worker. Hardware GPUs keep their fast path; no-GPU hosts get SwiftShader
without ceremony.

Behaviour matrix:
- No flag, no env, local       → "auto" (NEW default)
- `--browser-gpu`              → "hardware" (force; errors if no GPU)
- `--no-browser-gpu`           → "software" (force SwiftShader)
- `PRODUCER_BROWSER_GPU_MODE`  → "hardware" / "software" / "auto" / unset
- Docker mode                  → forced "software" (unchanged)

Engine-config default stays "software" (conservative for embedders); the
"auto" default lives in the CLI's `resolveBrowserGpuForCli` so producer
embedders aren't surprised by a probe-on-launch.

Also adds `--enable-unsafe-swiftshader` to the software flag set —
Chrome 120+ deprecated implicit SwiftShader fallback and emits a
deprecation warning unless the flag is set explicitly. Despite the
"unsafe" name this is exactly the pre-deprecation behaviour; the rename
is about Chrome's threat model on the open web, not about the rendering
itself.

Verification:
- Engine 535/535 + CLI 256/256 (incl. new probe tests + tri-state CLI test)
- Empirical: probe on this no-GPU devbox returns "software" in 240 ms,
  cached 0 ms on subsequent calls
- Format / lint / typecheck clean across all packages

Refs the Abhay/Slack thread on c2v-eval rendering without a GPU node.
Initial PR carried a backwards-compat shim where RenderOptions had both
`browserGpu?: boolean` (for docker) and `browserGpuMode?` (for local).
Since renderLocal/renderDocker have no external callers, simplify to a
single field. The boolean → docker-args conversion now happens inline at
the one site that needs it (`browserGpu: options.browserGpuMode === "hardware"`
when handing off to dockerRunArgs).

No behaviour change. 535/535 engine + 256/256 CLI still pass.
Three follow-ups from Vai's staff-eng review:

1. Concurrent-probe race (real bug): the parallel coordinator runs N
   workers via Promise.all, so `--workers 4` on a no-GPU host fired 4
   simultaneous probe Chromes — each paying the same 240 ms launch cost.
   Cache the *Promise* (not the resolved value): first caller assigns
   the in-flight Promise, every other concurrent caller awaits the same
   one. Verified with a new test asserting all concurrent callers get
   the identical Promise reference.

2. Stale rendering.md (lines 23, 29): user-visible contract said
   "browser GPU enabled by default", which was wrong post-auto. Now
   describes the auto / hardware / software trichotomy explicitly.

3. Silent fallback: auto-mode produced no output, so a regression to
   "always falls back to software even with GPU present" would have
   been invisible in production logs. Added a single stderr line per
   process when the probe resolves: `[hyperframes] browserGpuMode auto
   → <mode> (<reason>)`. Cache hits don't re-log.

Verification:
- Engine 536/536 (incl. new concurrent-dedup test asserting Promise
  reference equality across simultaneous callers)
- CLI 256/256
- Format / lint / typecheck clean
Two files on main fail `bun run format:check`:
- `registry/registry.json` (missing trailing newline)
- `registry/blocks/vfx-liquid-glass/vfx-liquid-glass.html` (whitespace
  + quote-style)

Local format-check on the auto-detect-browser-gpu branch only flagged
these once rebased onto current main (post-#647 / v0.5.2). Pure
whitespace fix; no semantic change.
@jrusso1020 jrusso1020 force-pushed the feat/auto-detect-browser-gpu branch from 076b614 to 37e3b36 Compare May 6, 2026 17:34
@jrusso1020 jrusso1020 merged commit 614f764 into main May 6, 2026
38 checks passed
@jrusso1020 jrusso1020 deleted the feat/auto-detect-browser-gpu branch May 6, 2026 21:50
pull Bot pushed a commit to bhardwajRahul/hyperframes that referenced this pull request May 7, 2026
…dows CI

The first dynamic `await import("./render.js")` cold-load takes >5 s on
Windows runners — long enough to blow vitest's default 5 s timeout in
whichever test ran it first. Subsequent imports are <10 ms because the
module is now cached, so only test #1 ever times out.

The downstream failure is more subtle: when test #1 times out, vitest
moves on, but its leaked async function eventually hits the synchronous
`producer.createRenderJob(...)` line and pushes a stale config to
`producerState.createdJobs`. That push lands AFTER test #2's `beforeEach`
clears the array, so test #2's `createdJobs[0]` is the leaked test #1
entry instead of its own. That's why test #2 saw `browserGpuMode: 'software'`
when it expected `'auto'`.

Hoist the import into `beforeAll` (matching the pattern the existing
`parseVariablesArg` and `validateVariablesAgainstProject` describe blocks
in this file already use). Cold-load happens once outside any test's
timeout window, every test stays fast, no leaked promise can corrupt
state.

Failing run: https://github.com/heygen-com/hyperframes/actions/runs/25470257972/job/74732502915
Started failing on main with the merge of heygen-com#642 (auto-detect-browser-gpu),
which added the "forwards browserGpuMode='auto'" test as test #2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants